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Abstract 

Canonical Correlation Analysis (CCA) is a 
widely used spectral technique for finding cor¬ 
relation structures in multi-view datasets. In 
this paper, we tackle the problem of large scale 
CCA, where classical algorithms, usually requir¬ 
ing computing the product of two huge matri¬ 
ces and huge matrix decomposition, are compu¬ 
tationally and storage expensive. We recast CCA 
from a novel perspective and propose a scalable 
and memory efficient Augmented Approximate 
Gradient (AppGrad) scheme for finding top k 
dimensional canonical subspace which only in¬ 
volves large matrix multiplying a thin matrix of 
width k and small matrix decomposition of di¬ 
mension k x k. Further, AppGrad achieves opti¬ 
mal storage complexity 0(k{p\ +P 2 )), compared 
with classical algorithms which usually require 
0(pi + P 2 ) space to store two dense whitening 
matrices. The proposed scheme naturally gen¬ 
eralizes to stochastic optimization regime, espe¬ 
cially efficient for huge datasets where batch al¬ 
gorithms are prohibitive. The online property 
of stochastic AppGrad is also well suited to the 
streaming scenario, where data comes sequen¬ 
tially. To the best of our knowledge, it is the 
first stochastic algorithm for CCA. Experiments 
on four real data sets are provided to show the 
effectiveness of the proposed methods. 

1. Introduction 

1.1. Background 

Canonical Correlation Analysis (CCA), first introduced in 
1936 by (Hotelling, 1936), is a foundamental statistical 
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tool to characterize the relationship between two multidi¬ 
mensional variables, which finds a wide range of appli¬ 
cations. For example, CCA naturally fits into multi-view 
learning tasks and tailored to generate low dimensional fea¬ 
ture representations using abandunt and inexpensive un¬ 
labeled datasets to supplement or refine the expensive la¬ 
beled data in a semi-supervised fashion. Improved gen¬ 
eralization accuracy has been witnessed or proved in ar¬ 
eas such as regression (Kakade & Foster, 2007), clustering 
(Chaudhuri et al., 2009; Blaschko & Lampert, 2008), di¬ 
mension reduction (Foster et al., 2008; McWilliams et al., 
2013), word embeddings (Dhillon et al., 2011; 2012), 
etc. Besides, CCA has also been succesfully applied 
to genome-wide association study (GWAS) and has been 
shown powerful for understanding the relationship between 
genetic variations and phenotypes (Witten et al., 2009; 
Chen et al., 2012). 

There are various equivalent ways to define CCA and here 
we use the linear algebraic formulation of (Golub & Zha, 
1995), which captures the very essense of the procedure, 
pursuing the directions of maximal correlations between 
two data matrices. 

Definition 1.1. For data matrices X 6 M" xpi . Y £ 
M nxp2 Let S x = X T X/n, S y = Y T Y /n, S xy = 
X T Y/n and p = min{pi ./p?}- The canonical correla¬ 
tions Ai, ■ ■ ■ , A p and corresponding pair of canonical vec¬ 
tors {{4>i, t/’*)}i=i between X and Y are defined recur¬ 
sively by 

= arg max ^ T S xy ip 

<£ T S x <£= 1,-0 SyV> = l 

cj) T S x 0i=O, ip T Syipi—O, — 1 

Xj = 4>J S xy rpj j = 

_ L _A -r- 

Lemma 1.1. Let S x 2 S xy S y 2 = UDV be the singular 

_ i _ i 

value decomposition. Then T* = S x 2 U, *F = S y 2 V, and 
A = D where $ = (cf> i, • • • , (j) p ), \F = (t/q, • • • , ip p ) and 
A = diag( Ai, • • • , X p ). 

The identifiability of canonical vectors (<F. fF) is equiva- 
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lent to the identifiability of the singular vectors (U,V). 

Lemma 1.1 implies that the leading k dimensional CCA 

subspace can be solved by first computing the whitening 
_1 _ 1 

matrices S x 2 , S y 2 and then perform a /.'-truncated SVD 

_ 1 _ 1 

on the whitened covariance matrix S x 2 S xy S y 2 . This 
classical algorithm is feasible and accurate when the data 
matrices are small but it can be slow and numerically un¬ 
stable for large scale datasets which are common in modern 
natural language processing (large corpora, Dhillon et al. 
(2011; 2012)) and multi-view learning (abandunt and inex¬ 
pensive unlabeled data, Hariharan & Subramanian (2014)) 
applications. 

Throughout the paper, we call the step of orthonormalizing 
the columns of X and Y whitening step. The computa¬ 
tional complexity of the classical algorithm is dominated 
by the whitening step. There are two major bottlenecks, 

• Huge matrix multiplication X T X,Y T Y to obtain 
S x , S y with computational complexity 0{np\ + np 2 ) 
for general dense X and Y. 

_ 1 

• Large matrix decomposition to compute S x 2 and 

_ 1 _ 

S y 2 with computational complexity 0(pf + p\) 
(Even when X and Y are sparse, S x , S y are not nec¬ 
essarily sparse) 

Remark 1.1. The whitening step dominates the k- 
truncated SVD step because the top k dimensional singular 
vectors can be efficiently computed by randomized SVD al¬ 
gorithms (see Halko et al. (2011 ) and many others). 
Remark 1.2. Another classical algorithm (built-in func¬ 
tion in Matlab) introduced in (Bjbrck & Golub, 1973) uses 
a different way of whitening. It first carrys out a QR de¬ 
composition, X = QajRa- and Y = Q y R y and then 
performs a SVD on Q J Q y , which has the same computa¬ 
tional complexity 0(np\ +np|) as the algorithm indicated 
by Lemma 1.1. However, it is difficult to exploit sparsity 
in QR factorization while X T X,Y T Y can be efficiently 
computed when X and Y are sparse. 

Besides computational issues, extra 0{p\ + p 2 ) space is 

_ l _i 

necessary to store two whitening matrices S x 2 and S y 2 
(typically dense). In high dimensional applications where 
the number of features is huge, this can be another bottle¬ 
neck considering the capacity of RAM of personal desktops 
(10-20 GB). In large distributed storage systems, the extra 
required space might incur heavy communication cost. 

Therefore, it is natural to ask: is there a scalable algo¬ 
rithm that avoids huge matrix decomposition and huge ma¬ 
trix multiplication? Is it memory efficient? Or even more 
ambitiously, is there an online algorithm that generates de¬ 
cent approximation given a fixed computational power (e.g. 
CPU time, FLOP)? 


1.2. Related Work 

Scalability begins to play an increasingly important role 
in modern machine learning applications and draws more 
and more attention. Recently lots of promising progress 
emerged in the literature concerning with randomized al¬ 
gorithms for large scale matrix approximations, SVD, and 
Principal Component Analysis (Sarlos, 2006; Liberty et al., 
2007; Woolfe et al., 2008; Halko et al., 2011). Unfortu¬ 
nately, these techniques does not directly solve CCA due 
to the whitening step. Several authors have tried to de¬ 
vise a scalable CCA algorithm. Avronetal. (2013) pro¬ 
posed an efficient approach for CCA between two tall and 
thin matrices (p \, p 2 -C n ) harnessing the recently de¬ 
veloped tools. Subsampled Randomized Hadamard Trans¬ 
form, which only subsampled a small proportion of the n 
data points to approximate the matrix product. However, 
when the size of the features, pi and p 2 , are large, the sam¬ 
pling scheme does not work. Later, Lu & Foster (2014) 
consider sparse design matrices and formulate CCA as iter¬ 
ative least squares, where in each iteration a fast regression 
algorithm that exploits sparsity is applied. 

Another related line of research considers stochastic 
optimization algorithms for PCA (Aroraetal., 2012; 
Mitliagkas et al., 2013; Balsubramani et al., 2013), which 
date back to Oja & Karhunen (1985). Compared with batch 
algorithms, the stochastic versions empirically converge 
much faster with similar accuracy. Further, these stochastic 
algorithms can be applied to streaming setting where data 
comes sequentially (one pass or several pass) without be¬ 
ing stored. As mentioned in (Arora et al., 2012), stochastic 
optimization algorithm for CCA is more challenging and 
remains an open problem because of the whitening step. 

1.3. Main Contribution 

The main contribution of this paper is to directly tackle 
CCA as a nonconvex optimization problem and propose 
a novel Augmented Approximate Gradient ( AppGrad ) 
scheme and its stochastic variant for finding the top k di¬ 
mensional canonical subspace. Its advantages over state- 
of-art CCA algorithms are three folds. Firstly, AppGrad 
scheme only involves large matrix multiplying a thin ma¬ 
trix of width k and small matrix decomposition of dimen¬ 
sion k x k, and therefore to some extent is free from the two 
bottlenecks. It also benefits if X and Y are sparse while 
classical algorithm still needs to invert the dense matrices 
X T X and Y T Y. Secondly, AppGrad achieves optimal 
storage complexity 0(k(p\ + P 2 )), the space necessary to 
store the output, compared with classical algorithms which 
usually require 0(p\ + p 2 ) for storing the whitening ma¬ 
trices. Thirdly, the stochastic (online) variant of AppGrad 
is especially efficient for large scale datasets if moderate 
accuracy is desired. It is well-suited to the case when com- 
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putational resources are limited or data comes as a stream. 
To the best of our knowledge, it is the first stochastic algo¬ 
rithm for CCA, which partly gives an affirmative answer to 
a question left open in (Arora et al., 2012). 

The rest of the paper is organized as follows. We introduce 
AppGrad scheme and establish its convergence properties 
in section 2. We extend the algorithm to stochastic settings 
in section 3. Extensive real data experiments are presented 
in section 4. Concluding remarks and future work are sum¬ 
marized in section 5. Proof of Theorem 2.1 and Proposi¬ 
tion 2.3 are relegated to the supplementary material. 

2. Algorithm 

For simplicity, we first focus on the leading canonical pair 
(</>i,'0i) to motivate the proposed algorithms. Results for 
general scenario can be obtained in the same manner and 
will be briefly discussed in the later part of this section. 

2.1. An Optimization Perspective 

Throughout the paper, we assume X and Y are of full rank. 
We use || • || for L 2 norm. Vu £ R Pl , v £ R P2 , we define 
IMU = (w T S x u)2 and ||u|| y = (u T S y u) 2 , which are 
norms induced by X and Y. 

To begin with, we recast CCA as an nonconvex optimiza¬ 
tion problem (Golub & Zha, 1995). 

Lemma 2.1. (<p\,ip\) is the solution of 

mm—WXf-YM 2 

2 n" ' 11 (1) 

subject to <j) T S x (f> = 1, tp T S y ip = 1 


Although (1) is a nonconvex (due to the nonconvex con¬ 
straint), (Golub & Zha, 1995) showed that an alternating 
minimization strategy (Algorithm 1), or rather iterative 
least squares, actually converges to the leading canoni¬ 
cal pair. However, each update cp t+1 = S x * 1 S xy '(/T is 
computationally intensive. Essentially, the alternating least 
squares acts like a second order method, which is usually 
recognized to be inefficient for large-scale datasets, espe¬ 
cially when current estimate is not close enough to the op¬ 
timum. Therefore, it is natural to ask: is there a valid first 
order method that solves (1)? Heuristics borrowed from 
convex optimization literature give rise to a projected gra¬ 
dient scheme summarized in Algorithm 2. Instead of com¬ 
pletely solving a least squares in each iterate, a single gra¬ 
dient step of ( 1 ) is performed and then project back to the 
constrained domain, which avoids inverting a huge matrix. 
Unfortunately, the following proposition demonstrates that 
Algorithm 2 fails to converge to the leading canonical pair. 


Algorithm 1 CCA via Alternating Least Squares 

Input: Data matrix X £ R" xpi , Y £ R " xp2 and ini¬ 
tialization (c f>° , ip°) 

Output :(<p AlS! V’a Ls) 

repeat 

4> t+1 = argnun^||X<j>- Y^*|| 2 = S x 1 S xy V’ t 

(p t+1 =f t+1 /H t+1 \\ x 

tp t+ 1 = argnun^HY^ - X<j>*|| 2 = S y 1 S yx <j) t 

tp t+1 = tp t+1 /\\ip t+1 \\ y 
until convergence 


Algorithm 2 CCA via Naive Gradient Descent 

Input: Data matrix X £ R™ xpi , Y £ R” xp2 , initializa¬ 
tion ((jp, tp°), step size 771 , 772 
Output : NAN (incorrect algorithm) 

repeat 

f t+1 = (jf - 7 ? iX T (X(j) t - Y ip*)/n 

<P t+1 = <p t+ 1 /\\<p t+1 \\ x 

ip t+ 1 =f) t - rj 2 Y T (Yip t - X0*)/n 

ft t+1 =fj t+1 /H t+1 \\ y 

until convergence 


Proposition 2.1. If leading canonical correlation Ai 1 
and either (pi is not an eigenvector of S x or ip\ is not an 
eigenvector of S y , then Vr/i . 7)2 > 0, the leading canoni¬ 
cal pair ip 1 ) is not a fixed point of the naive gradient 
scheme in Algorithm 2. Therefore, the algorithm does not 
converge to 

Proof of Proposition 2.1. The proof is similar to the proof 
of Proposition 2.2 and we leave out the details here. □ 

The failure of Algorithm 2 is due to the nonconvex na¬ 
ture of (1). Although every gradient step might decrease 
the objective function, this property no longer persists af¬ 
ter projecting to its nonconvex domain {(<^, ip) \ <p T S x <p = 

1, ip'Syip = 1} (the normalization step). On the contrary, 
decreases triggered by gradient descent is always main¬ 
tained if projecting to a convex region. 

2.2. AppGrad Scheme 

As a remedy, we propose a novel Augmented Approximate 
Gradient (, AppGrad) scheme summarized in Algorithm 3. 
It inherits the convergence guarantee of alternating least 
squares as well as the scalability and memory efficiency 
of first order methods, which only involves matrix-vector 
multiplication and only requires 0{p\ + P 2 ) extra space. 

AppGrad seems unnatural at first sight but has some nice 
intuitions behind as we will discuss later. The differences 
and similarities between these algorithms are subtle but 
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Algorithm 3 CCA via AppGrad 

Input: Data matrix X £ M. nxpi , Y £ R" xp2 , initializa¬ 
tion {(p°, ip°, (p°, ■0°), step size r/i,r )2 
Output: {(p ag , ip ag j <Pag j t^AG) 

repeat 

<j) t+1 =f~ r/iX^X^ ~ Y ip*)/n 

0 t+1 = <^ + 7 ll^ t+1 IU _ 

t/> t+1 = _ r/2 Y T (YV’ t - X0*)/n 

</> t+1 =^+i/||^ t + 1 || y 

until convergence 


crucial. Compared with the naive gradient descent, we in¬ 
troduce two auxiliary variables (c p*,ip*), an unnormalized 
version of {(p*, ip*). During each iterate, we keep updat¬ 
ing (p* and ip* without scaling them to have unit norm, 
which in turn produces the ‘correct’ normalized counter¬ 
part, {(p*,ip*). It turns out that {(pi, ipi, Xifii, Xiipi) is a 
fixed point of the dynamic system {{(p*, ip*, (p*, ip*)} < fL 0 . 

Proposition 2.2. V* < p, let (pi = Xi(pi,ipi = A iipi, then 
{(pi, ipi, (pi, ipi) are the fixed points of AppGrad scheme. 

To prove the proposition, we need the following lemma that 
characterizes the relations among some key quantities. 

Lemma 2.2. S xy = S x $A’4' T S y 

Proof of Lemma 2.2. By Lemma 1.1, S x 2 S xy S y 2 = 
UDV t , where U = Sl$, V = SJ 'P and D = A. Then 
we have S xy = SlUDV T s| = S x $A^ T S y . □ 

Proof of Proposition 2.2. Substitute {(p*, ip 1 , (p*, ip*) = 
{(pi, ipi, (pi, ipi) into the iterative formula in Algorithm 3. 

4> t+1 =<j>i~ t?i(S x ^i - S xy ipi) 

= 4>i- T7i(S J>i - S x $A* T S y 0;) 

= 02 ^l(S X 02 ^iS x 0j) 

= 02 

The second equality is direct application of Lemma 2.2. 
The third equality is due to the fact that \P T S y 'I' = I p . 
Then, 

(p t+1 = (pi/\\(pi\\x = (pi/Xi = (pi 

Therefore {(p t+1 , (p t+1 ) = {(p*,(p*) = {(pi,(pf). Asymmet¬ 
ric argument will show that {ip* +1 , ip t+1 ) = {ip*, ip*) = 
{ipi, ipi), which completes the proof. □ 

The connection between AppGrad and alternating min¬ 
imization strategy is not instaneous. Intuitively, when 
{(p*,ip*) is not close to {(pi,ipi), solving the least squares 
completely as carried out in Algorithm 1 is a waste of com¬ 
putational power (informally by regarding it as a second 


order method, the Newton Step has fast convergence only 
when current estimate is close to the optimum). Instead 
of solving a sequence of possibly unrelevant least squares, 
the following lemma shows that AppGrad directly targets 
at the least squares that involves the leading canonical pair. 

Lemma 2.3. Let {(pi,ipi) be the leading canonical pair 
and {(pi,ipi) = Xi{(pi,ipi). Then, 


(pi = argmin -^-||X(/> — Yi/q|| 2 
^ An 

Lpi = arg min -^-||Yt/> - X</>i|| 2 
ip 2 n 


( 2 ) 


Proof of Lemma 2.3. Let (p* = argmin J-|| X0— Y^iH 2 , 

4, n 

by optimality condition, S x (p* = S xy Li. Apply 
Lemma 2.2, 

cp* = S x - 1 S x $A^ T S y V>i = Xifii = fix 
Similar argument gives ip* = ipi □ 

Lemma 2.3 characterizes the relationship between lead¬ 
ing canonical pair {(p\,ipi) and its unnormalized counter¬ 
part {(pi,ipi), which sheds some insight on how AppGrad 
works. The intuition is that {<p*,ip*) and (0*, "0*) are cur- 
rent estimations of {(pi,ipi) and {(pi,ipi), and the updates 
of {(p t+1 , ip* +1 ) in Algorithm 3 are actually gradient steps 
of the least squares in (2), with the unknown truth {(pi, ipi) 
approximated by (c p*, ip*). In terms of mathematics, 

4> t+1 =fi t - mX T {Xf* - Yip t )/n 

^^ t -7 ?1 X T (X^-Y^i )/n (3) 

The normalization step in Algorithm 3 corresponds 
to generating new approximations of {<p\,ipi), namely 
{(p t+1 , ip t+1 ), using the updated (<p t+1 , ip t+1 ) through the 
relationship {(pi,ipi) = {(pi/\\(pi\\ x , ipi/\\ipi\\v). There¬ 
fore, one can interpret AppGrad as approximate gradi¬ 
ent scheme for solving (2). When {(p*,ip*) converge to 
{(pi,ipi), its scaled version (</>*, ip*) converge to the lead¬ 
ing canonical pair {(pi,ipi). 

The following theorem shows that when the estimates enter 
a neighborhood of the true canonical pair, AppGrad is con¬ 
tractive. Define the error metric et = ||A<^|| 2 + ||A^> t || 2 
where A<p* = <p* — <pi, Aip* = ip* — ipi. 

Theorem 2.1. Assume Ai > A 2 , 3Li,L2 > 1 such that 
Amaa:(Sx) , X max (S y ) < Li and X m i n (S x ), Xmi n (S y ) A 
Lf , where X m i n {-), X max {-) denote smallest and largest 
eigenvalues. If eo < 2(A 2 — X^)/Li and set rji = r ]2 = 
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Algorithm 4 CCA via AppGrad (Rank-A:) 

Input: Datamatrix^X £ R nxpi , Y £ R" xp2 , initializa¬ 
tion (4>°, 4/°, 4>°, 4/°), step size 771 , 772 
Output: (3 >ac, 'S'ao, ^ag, ^ag) 

repeat 

$*+i =J> t - ?yiXj(X$ t - Y^‘)/n 
SVD: (# i+ 1 ) T S x $ i+1 = Ua-D^Uj 
$*+i = $ t+ 1 U :r D^^Uj 
4> t+1 =jj>* _ r?2 Y T (Y§ t - X4>*)/n 
SVD: (vp t+ 1 ) T S y \l > t+1 = UyDylJT 

q,t+i = yt+i 

until convergence 


?7 = S/ 6 L 1 , then AppGrad achieves linear convergence 
such thatM t £ N+ 

/ S 2 \* 

e ‘- i 1 - eijrJ 

where <5 = 1 — (l — ^ " ) > 0 

Remark 2.1. The theorem reveals that the larger is the 
eigengap Ai — A 2 , the broader is the contraction region. We 
didn ’t try to optimize the conditions above and empirically 
as shown in the experiments, a randomized initialization 
always suffices to capture most of the correlations. 

2.3. General Rank-/ Case 

Following the spirit of rank-one case, AppGrad can be eas¬ 
ily generalized to compute the top k dimesional canonical 
subspace as summarized in Algorithm 4. The only differ¬ 
ence is that the original scalar normalization is replaced by 
its matrix counterpart, that is to multiply the inverse of the 

square root matrix 4> i+1 = <fT +1 U ;/ ;D :T: 2 U., 1 ., ensuring 
that ($ t+ 1 ) T X T X $ t+1 = I fc . 

Notice that the gradient step only involves a large matrix 
multiplying a thin matrix of width k and the SVD is per¬ 
formed on a small k x k matrix. Therefore, the computa¬ 
tional complexity per iteration is dominated by the gradient 
step, of order 0(n(pi + P 2 )k). The cost will be further re¬ 
duced when the data matrices X, Y are sparse. 

Compared with classical spectral agorithm which first 
whitens the data matrices and then performs a SVD on 
the whitened covariance matrix, AppGrad actually merges 
these two steps together. This is the key of its efficiency. In 
a high level, whitening the whole data matrix is not neces¬ 
sary and we only want to whiten the directions that contain 
the leading CCA subspace. However, these directions are 
unknown and therefore for two-step procedures, whitening 
the whole data matrix is unavoidable. Instead, AppGrad 
tries to identify (gradient step) and whiten (normalization 


step) these directions simultaneously. In this way, every 
normalization step is only performed on the potential k di¬ 
mensional target CCA subspace and therefore only deals 
with a small k x k matrix. 

Parallel results of Lemma 2.1, Proposition 2.1, Proposi¬ 
tion 2.2, Lemma 2.3 for this general scenario can be es¬ 
tablished in a similar manner. Here, to make Algorithm 4 
more clear, we state the fixed point result of which the proof 
is similar to Proposition 2.2. 

Proposition 2.3. Let Ay = dia<j{X \, • • • , Xy) be the diag¬ 
onal matrix of top k canonical correlations and let 4?/. = 
(</> 1 , • • • , f>k), ’3 ’k = (</>i, • • • , befhe top k CCA vec¬ 

tors. Also denote f Py = 4*/.A/< and y = rp yAy. Then 
for any k x k orthogonal matrix Q, \Pfc, 4>fc, H r fc)Q 
is a fixed point of AppGrad scheme. 

The top k dimensional canonical subspace is identifiable up 
to a rotation matrix and Proposition 2.3 shows that every 
optimum is a fixed point of AppGrad scheme. 

2.4. Kernelization 

Sometimes CCA is restricted because of its linearity and 
kernel CCA offers an alternative by projecting data into a 
high dimensional feature space. In this section, we show 
that AppGrad works for kernel CCA as well. Let K x (■. •) 
and Kyf,-) be Mercer kernels, then there exists feature 
mappings fx : X — > Tx and fy : y —y Ty such 
that K x (xi,Xj) = {fx(xi),fx(xj)) and Kyipi^j) = 
(.fy{Vi), fyiVj))- Let F x = (f x (x 1 ), • • • , f x (x n )) r and 
= (fy{yi)> -I fy{yn)) T be the compact representa¬ 
tion of the objects in the possibly infinite dimensional fea¬ 
ture space. Since the top k dimensional canonical vectors 
lie in the space spaned by the features, say ( l>y = F T X W X 
and V k = FjWy for some W x ,Wy £ K nxfe . Let 
K x = F x Fj., Ky = FyFj be the Gram matrices. Sim¬ 
ilar to Lemma 2.1, kernel CCA can be formulated as 

arg max \\K X W X - K y W y \\ 2 F 

Wa',Wjj 

subject to WjK*K*W* = l k WjK^K^Wy, = I k 

Following the same logic as Proposition 2.3, a similar fixed 
point result can be proved. Therefore, Algorithm 4 can be 
directly applied to compute W^jW^by simply replacing 
X, Y with K x ,Ky. 

3. Stochastic AppGrad 

Recently, there is a growing interest in stochastic op¬ 
timization which is shown to have better performance 
for large-scale learning problems (Bousquet & Bottou, 
2008; Bottou, 2010). Especially in the so-called ‘data 
laden regime’, where data is abundant and the bottle¬ 
neck is runtime, stochastic optimization dominate batch 
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Algorithm 5 CCA via Stochastic AppGrad (Rank-fc) 
Input: Datamatrix^X £ R nxpi , Y £ R" xp2 , initializa¬ 
tion (<&°, 4> 0 , \P°), step size r) it, r) 2 t, minibatch size 

m 

Output: (fI?SAG ; ^SAG, ^sag, ^sag) 

repeat 

Randomly pick a subset I C {1,2, • • • , n} of size m 
= jpt _ riu xJ(X I $ / - Yj$ f )/m 
SVD: ($ t+1 ) T (iXjXi)# t + 1 = UjD.U, 

<j > t+1 = 

§‘+1 =Jj>* - T] 2t Yj(Y x & - Xi^ 4 )/m 
SVD: (^+i)T(i.YjYx)^ t+1 = UjD„U„ 

* t+ 1 = ^ t+1 Uj[D^U y 
until convergence 


algorithms both empirically and theoretically. Given 
these advantages, lots of efforts have been spent on 
developing stochastic algorithms for principal compo¬ 
nent analysis (Oja & Karhunen, 1985; Aroraetal., 2012; 
Mitliagkas et al., 2013; Balsubramani et ah, 2013). De¬ 
spite promising progress in PCA, as mentioned in 
(Aroraetal., 2012), stochastic CCA is more challenging 
and remains an open problem due to the whitening step. 

As a gradient scheme, AppGrad naturally generalizes to the 
stochastic regime and we summarize in Algorithm 5. Com¬ 
pared with the batch version, only a small subset of sam¬ 
ples are used to compute the gradient, which reduces the 
computational cost per iteration from 0 (n(pi + P 2 )k) to 
0 (m(pi + P 2 )k) (m = \I\ is the size of the minibatch). 
Empirically, this makes stochastic AppGrad much faster 
than the batch version as we will see in the experiments. 
Also, for large scale applications when fully calculating the 
CCA subspace is prohibitive, stochastic AppGrad can gen¬ 
erate a decent approximation given a fixed computational 
power, while other algorithms only give a one-shot esti¬ 
mate after the whole procedure is carried out completely. 
Moreover, when there is a generative model, as shown in 
(Bousquet & Bottou, 2008), due to the tradeoff between 
statistical and numerical accuracy, fully solving an empir¬ 
ical risk minimization is unnecessary since the statistical 
error will finally dominate. On the contrary, stochastic op¬ 
timization directly tackles the problem in the population 
level and therefore is more statistically efficient. 

It is worth mentioning that the normalization step is ac¬ 
complished using a sampled Gram matrix A-XjXj and 
—YjYx- A key observation is that when m £ 0(k), 
($‘ +1 ) T (iXjXx)$ t+1 « ($‘ +1 ) T (iX T X)$‘ +1 us¬ 
ing standard concentration inequality, because the matrix 
we want to approximate (3> t+1 ) T (-I-X T X)‘I> t+1 is a kxk 
matrix, while generally 0 (p) sample is needed to have 


—XjXx « -X T X. As we have argued in previous sec¬ 
tion, this bonus is a byproduct of the fact that AppGrad 
tries to identify and whiten the directions that contains the 
CCA subspace simultaneously, or else 0(p) samples are 
necessary for whitening the whole data matrices. 

4. Experiments 

In this section, we present experiments on four real datasets 
to evaluate the effectiveness of the proposed algorithms for 
computing the top 20 (k= 20) dimensional canonical sub¬ 
space. A short summary of the datasets is in Table 1. 

Mediamill is an annotated video dataset from the Medi- 
amill Challenge (Snoek et al„ 2006). Each image is a rep¬ 
resentative keyframe of a video shot annotated with 101 
labels and consists of 120 features. CCA is performed to 
explore the correlation structure between the images and its 
labels. 

MNIST is a database of handwritten digits. CCA is used to 
learn correlated representations between the left and right 
halves of the images. 

Penn Tree Bank dataset is extracted from Wall Street Jour¬ 
nal, which consists of 1.17 million tokens and a vocabu¬ 
lary size of 43,000 (Lamar et ah, 2010). CCA has been 
successfully used on this dataset to build low dimensional 
word embeddings (Dhillon et ah, 2011; 2012). The task 
here is a CCA between words and their context. We only 
consider the 10, 000 most frequent words to avoid sample 
sparsity. 

URL Reputation dataset (Ma et ah, 2009) is extracted 
from UCI machine learning repository. The dataset con¬ 
tains 2.4 million URLs each represented by 3.2 million 
features. For simplicity we only use the first 2 million 
samples. 38% of the features are host based features like 
WHOIS info, IP prefix and 62% are lexical based features 
like Hostname and Primary domain. We run a CCA be¬ 
tween a subset of host based features and a subset of lexical 
based features. 

4.1. Implementations 

Evaluation Criterion: The evaluation criterion we use 
for the first three datasets (Mediamill, MNIST, Penn Tree 
Bank) is Proportions of Correlations Captured (PCC). To 
introduce this term, we first define Total Correlations Cap¬ 
tured (TCC) between two matrices to be the sum of their 
canonical correlations as defined in Lemma 1.1. Then, for 
estimated top k dimensional canonical subspace 
and true leading k dimensional CCA subspace <b/, . 

PCC is defined as 

TCC(X$fc, Y$fc) 

TCC(X4?fc, Y\Pfc) 
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Table 1. Brief Summary of Datasets 


Datasets 

Description 

Pi 

P2 

n 

Mediamill 

Mnist 

Penn Tree Bank 

URL Reputation 

Image and its labels 

LEFT AND RIGHT HALVES OF IMAGES 

Word Co-ocurrance 

Host and lexical based features 

100 

392 

10,000 

100,000 

120 

392 

10,000 

100,000 

30,000 

60,000 

500,000 

1,000,000 


Intuitively PCC characterizes the proportion of correlations 
captured by certain algorithm compared with the true CCA 
subspace. Therefore, the higher is PCC the better is the 
estimated CCA subspace. This criterion has two major ad¬ 
vantages over subspace distance | — P$> k || ( Pq is pro¬ 

jection matrix of the column space of U). First, it is more 
natural and relevant considering that the goal of CCA is to 
capture most correlations between two data matrices. Sec¬ 
ond, when the eigengap A A = A*,— Afc+i is not big enough, 
the top k dimensional CCA subspace is ill posed while the 
correlations captured is well defined. 

We use the output of the standard spectral algorithms as the 
truth ('<!>/,.. T'fc) to calculate the denominator of PCC. How¬ 
ever, for URL Reputation dataset, the number of samples 
and features are too large for the algorithm to compute the 
true CCA subspace in a reasonable amount of time and in¬ 
stead we only compare the numerator TCC(X4?fc, YSP^.) 
(monotone w.r.t. PCC) for different algorithms. 

Initialization We initialize \P°) by first drawing i.i.d. 
samples from standard Gaussian distribution and then nor¬ 
malize such that (<J?°) T S x <f> 0 = Ik and (\P 0 ) T S y \I> 0 = Ik 

Stepsize For both AppGrad and stochastic AppGrad, a 
small part of the training set is held out and cross-validation 
is used to choose the step size adaptively. 

Regularization For all the algorithms, a little regulariza¬ 
tion is added for numerical stability which means we re¬ 
place Gram matrix X T X with X T X + AI for some small 
positive A. 

Oversampling Oversampling means when aiming for top 
k dimensional subspace, people usually computes top k +1 
dimesional subspace from which a best k diemsional sub¬ 
space is extracted. In practice, l = 5 ~ 10 suffices to 
improve the performance. We only do a oversampling of 5 
in the URL dataset. 

4.2. Summary of Results 

For the first three datasets (Mediamill, MNIST, Penn Tree 
Bank), both in-sample and out-of-sample PCC are com¬ 
puted for AppGrad and Stochastic AppGrad as summarized 
in Figure 1. As you can see, both algorithms nearly capture 
most of the correlations compared with the true CCA sub¬ 
space and stochastic AppGrad consistently achieves same 


PCC with much less computational cost than its batch ver¬ 
sion. Moreover, the larger is the size of the data, the bigger 
advantage will stochastic AppGrad obtain. One thing to no¬ 
tice is that, as revealed in Mediamill dataset, out-of-sample 
PCC is not necessarily less than in-sample PCC because 
both denominator and numerator will change on the hold 
out set. 

For URL Reputation dataset, as we mentioned earlier, clas¬ 
sical algorithms fails on a typical desktop. The reason is 
that these algorithms only produce a one-shot estimate af¬ 
ter the whole procedure is completed, which is usually pro¬ 
hibitive for huge datasets. In this scenario, the advantage 
of online algorithms like stochastic AppGrad becomes cru¬ 
cial. Further, the stochastic nature makes the algorithm 
cost-effective and generate decent approximations given 
fixed computational resources (e.g. FLOP). As revealed by 
Figure 2, as the number of iterations increases, stochastic 
AppGrad captures more and more correlations. 

Since the true CCA subspaces for URL dataset is too slow 
to compute, we compare our algorithm with some naive 
heuristics which can be carried out efficiently in large scale 
and catches a reasonable amount of correlation. Below is a 
brief description of them. 

• Non-Whitening (NW-CCA): directly perform S VD on 
the unwhitened covariance matrix X 3 Y. This strat¬ 
egy is also used in (Witten et ah, 2009) 

• Diagnoally Whitening (DW-CCA) (Lu & Foster, 
2014): avoid inverting matrices by approximating 
S x 2 ,S y 2 with (diag(S x )) — i and (diag(S y ))“5. 

• Whitening the leading m Principal Component Direc¬ 
tions (PCA-CCA): First compute the leading m di¬ 
mensional principal component subspace and project 
the data matrices X and Y to the subspace, denote 
them XJ x and TJ y . Then compute the top k dimen¬ 
sional CCA subspace of the pair (U^, Uy). At last, 
transform the CCA subspace of (U^Uy) back to 
the CCA subspace of orginal matrix pair (X, Y). 
Specifically for this example, we choose m = 1200 
(log(FLOP)=35, dominating the computational cost of 
Stochastic AppGrad). 

Remark 4.1. For all the heuristics mentioned above, SVD 
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Mediamill AppGrad 



Mediamill S-AppGrad 


Mnist AppGrad 



Mnist S-AppGrad 


PTB AppGrad 



PTB S-AppGrad 





Figure 1. Proportion of Correlations Captured (PCC) by AppGrad and stochastic AppGrad on different datasets 


URL Outsample 



Figure 2. Total Correlations Captured (TCC) by NW-CCA, DW- 
CCA, PCA-CCA and stochastic AppGrad on URL dataset. The 
dash lines indicate TCC for those heuristics and the colored 
dots denote corresponding computational cost. Red arrow means 
log(FLOP) of PCA-CCA is more than 33. 

and PCA steps are carried out using the randomized algo¬ 
rithms in (Halko et al., 2011). For PCA-CCA, as the num¬ 
ber of Principal Components (m) increases, more corre¬ 
lation will be captured but the computational cost will also 
increase. When m = p, PCA-CCA is reduced to the orginal 
CCA. 

Essentially, all the heuristics are incorrect algorithms and 
try to approximately whiten the data matrices. As sug¬ 


gested by Figure 2, stochastic AppGrad significantly cap¬ 
tures much more correlations. 

5. Conclusions and Future Work 

In this paper, we present a novel first order method, App¬ 
Grad, to tackle large scale CCA as a nonconvex optimiza¬ 
tion problem. This bottleneck-free algorithm is both mem¬ 
ory efficient and computationally scalable. More impor¬ 
tantly, its online variant is well-suited to practical high 
dimensional applications where batch algorithm is pro¬ 
hibitive and data laden regime where data is abundant and 
runtime is main concern. 

Further, AppGrad is flexible and structure information 
can be easily incorporated into the algorithm. For ex¬ 
ample, if the canonical vectors are assumed to be sparse 
(Witten et al., 2009; Gao et al., 2014), a thresholding step 
can be added between the gradient step and normalization 
step to obtain sparse solutions while it is hard to add sparse 
constraint to the classical CCA formulation which is a gen¬ 
eralized eigenvalue problem. Heuristics in (Witten et al., 
2009) avoid this by simply skipping the whitening proce¬ 
dure (NW-CCA). (Gao et al., 2014) resorts to semidefinite 
programming and therefore is very slow. AppGrad with 
thresholding works well in simulations and we leave its the¬ 
oretical properties for future research. 
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Supplementary Material: Finding Linear Structure 
in Large Datasets with Scalable Canonical Correlation Analysis 


6. Proofs 

6.1. Preliminaries 

A brief review of the notations in the main paper: 

S x = X T X/n, S xy = X t Y /n, S y = Y T Y/n, \\u\\ x = (u T S x u)3, \\v\\ y = ( v T S y vft 
Aft = ft — ft, Aft = ft — ft, Aft = ft — ft, Aft = ft — ft 


Further, we define cos x (u,v) = [j^jjppjp the cosine of the angle between two vectors induced by the inner product 

(u, v) = u T S x v. Similarly, we define cos y (u, v) = ,-Aj SvV 
lemma. 


-. To prove the theorem, we will repeatedly use the following 


Lemma 6.1. ||A^|| X < 1+cos ^fl>J A ^1U and\\Aft\\ y < 1+COSy \ rM \\ Aft\\ y 

Proof of Lemma 6.1 Notice that cosftft, ft) = cosftft, ft), then 

M\l = U* - ft\\l > WftW 2 sinl(ft,ft) = A \sinl{ft,ft) 

Also notice that \\ft\\ x = ||(^i||a; = 1, which implies cos x (ft, ff) = 1 — — </>i||^/2 = 1 — || A</> t ||^/2. Further 

\\Aft\\l > \\sinl(ft,ft) = \\{l - cosl{ft,ft)) = ^\\Aft\\ 2 x (l + cosftft,ft)) 


Square root both sides, 


Similar argument will show that 


6.2. Proof of Theorem 2.1 


1 


l|A^|U<- 


Ai V 1 + cos x (ft,ft) 


MU 


i 


l|A^||y< T- 


Al V 1 + COSy(ft,ft) 


MU 


Without loss of generality, we can always assume cosftft, ft), cos y (ft, ft) > 0 because the canonical vectors are only 
identifiable up to a flip in sign and we can always choose ft, ft such that the cosines are nonnegative. Apply simple 
algebra to the gradient step ft +1 = ft — rj(S x ft — S xy ft), we have 

ft +1 ~ ft = ft - ft - ft Sftft - ft) + S X ft ~ S X y(ft - ft) - S X yft) 

Aft +1 = Aft - ftS x Aft - S xy Aft) - T)(S x ft - S xy ft) 

By Lemma 2.2, ftS x ft — S xy f)i) = r](S x ft — AiS x ft) = 0, which implies 

It+i _ 


Aft +L = Aft - ftS x Aft - S xy Aft) 


Square both sizes. 


\\Aft +1 ft = \\Aft\\ 2 + ft\\S x Aft - S xy A^|| 2 - 2ftAft ) 1 (S x Aft - S xy Aft) 


(4) 
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Apply Lemma 2.2, 


|SxyA^|| = 


|S x d>A* T S y A^|| < ||SJ||||S^||||A||||<& t S|||||S = A^|| < AiLf \\A^% 


The last inequality uses the assumption that A max (S x ), A max (S y ) < L\. By Lemma6.1, || < T 7 ll A V )t || ! /- Hence, 

||S xy Ai/) t || < v / 2iT|| A-0*||Also notice that IISxA^II < ||S| ||||S| A<^>*|| < LI ||A</> 4 || X , then 

||S x A^-S xy A^|| 2 <(4l|A^|U + V 24 ||A^y 2 < 2 L 1 (||A 0 t ||^ + 2||A^||^) 

Substitute into (4), 

||A^ +1 || 2 < ||A^|| 2 - 2^11 A0‘|| 2 + 2L 1?7 2 (||A^|| 2 + 2||A^|| 2 ) + 2r ? (A^ t ) T S xy A^ (5) 

— 1 1 
Now, we are going to bound (Aft^SxyAi/j*. Because S y 'I' is an orthonormal matrix (orthogonal if p = p±) and S y ipt 

1 1 

is a unit vector, there exisit coefficients ai,--- ,a p ,a± and unit vector ip± £ ColSpan( S y \E')- L such that S y ipt = 
Y7i =i + Oi±S§i/)±, Ya=i a 2 + a\ = 1. Therefore, 


(A</>*) T S x ‘f>A\f' T S y A^ t = A</) t S x $A(s|^) T {(a 1 - l)s|v»i + £o<s| + a ± S§ip ± } 

i=2 
P 

= A 1 (a 1 - l)(Ac/) t ) T S x (/) 1 + ^ a l Ai(A(/) t ) T S x ^ i 

2=2 


By Cauthy-Schwarz inequality, 

( A ^) T S x *A* T S y A^ < (a 2 (1 - ai ) 2 + a 2 A 2 ) 5 ( £ ((A^) T S X ^) 2 ) ! 


1 , P 


< (a?( 1 - «r) 2 + ^(1 - a?)) ’ II A0‘|| a 


= (A? TT ^ + A 2 )" (1 -a 2) ^| |A 0 t || a 

\\± 1 >X 


By definition, 1 — ai = 1 — cosypp* ,ip\) = "y "" . Further by Lemma 6.1, 


rilin' 


A 1 (l + oil) 


Therefore, 


(A</> t ) T S x $A^ T S y A^ < ( 

( 


1 — a\ Ao N ” 

IT^ + a! 


)"||A^|U||A^|| 4 


|| A-iA* || 2 

^5^ +^'11^11,11^*11, 


< 


< 


Ai' 4 


WllA^H 2 A 2 y 
2 V A 2 + A 2 


IIA^II; 


Substitute into (5), 

||A^ +1 || 2 < || A^|| 2 - 2t ? ||A0*|| 2 + 2 L lV 2 (||A0‘|| 2 + 2||A^|| 2 ) + ’ (Wlli + || A ^|| 2 ) 

Similar analysis implies that, 

|| A ^ +1 || 2 < || A ^|| 2 - 2r,\\A^% + 2 L lV 2 (||A^|| 2 + 2 ||A^|| 2 ) + * (Wl l + II A ^l| 2 ) 
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Add these two inequalities, 


I|a^+t + I|a^ + T < 


3L lV }(\\A^\\l + \\A^\\l) 


Notice that sfa + \fb < \j2(a + b), we have 


(WX 

, a|\* , 

r||A ^|| 2 

, A|\ 

V A? 

+ A?) +l 

^ A? 

+ A?) 


< 


Then, 


I|a^t + ||a^T< 


2\\Ani , 4 a|ni 

V A 2 + X\ + XfJ 

/2i 1 ||A^ t || 2 2L 1 ||A^|| 2 4A 2 \ \ 

V a 2 + xl + x \) 
±.^\\A^\\ 2 + ^\\A^r + ^)" 


|| 2 + ||A^|| 2 ) - 277(1 - — (^||A^|| 2 + ^||A^|| 2 + A 2 ) ^ 
3L^}(||A^|| 2 + ||A^|| 2 ) 


1 

By definition, S = 1 — -jA- || A^°|| 2 + k±\\ A</>°|| 2 + X^j and rf = Substitute in (6) with t = 0, 


||A0 1 || 2 + ||A-0 1 || 2 = 


0 II 2 


+ \\Ar\\ 2 )-^{\\An 2 x +\\^x 

(||A^|| 2 + ||A/|| 2 ) - ^-(||A^|| 2 + ||A^|| 2 


< 


6 2 


"(‘-61^)(llA”ll 2 + l|i^T) 


It follows by induction that V t £ N+ 

||A^ +1 || 2 + ||A^ +1 || 2 


< 1 - 


6 2 


6 L 1 L 2 


)(||A^|| 2 + ||A#|p 


( 6 ) 


6.3. Proof of Proposition 2.3 

Substitute (<!>' . 'P' , d?'. 'I'*) = ($ fc , \P k . d ? k A k , T * k A k )Q into the iterative formula in Algorithm 4. 

& +1 = ^fcAfcQ - t?i(S x tPfcAfc - S xy \Pj;)Q 

= <4> k A k Q - Vl (S^ k A k - S x $Ad/ T S y d/fe)Q 
^fcA^Q (S x d?k A^ S x $ fc A fc )Q 

= $feA^Q 

The second equality is direct application of Lemma 1. The third equality is due to the fact that \P T S y \P = I p . Then, 


($ 1+1 ) T S x $ f+1 = Q t A£Q 


and 

dh*+i = $‘+ 1 Q t A^: 1 Q = $ fc Q 

Therefore (d> t+1 , d> t+1 ) = ($*,$*) = (d> fc) d>/,,Afc)Q. A symmetric argument will show that (d> t+1 , \p t+1 ) = 
(vpf qyt) = (d>fe, \PfcAfc)Q, which completes the proof. 















